Statistical properties of genealogical trees 
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Abstract 

We analyse the statistical properties of genealogical trees in a neutral model 



S 

1-^ ■ of a closed population with sexual reproduction and non-overlapping genera- 

a: 

O ■ tions. By reconstructing the genealogy of an individual from the population 



evolution, we measure the distribution of ancestors appearing more than once 
in a given tree. After a transient time, the probability of repetition follows, 
up to a rescaling, a stationary distribution which we calculate both numer- 
ically and analytically. This distribution exhibits a universal shape with a 
non-trivial power law which can be understood by an exact, though simple, 
renormalization calculation. Some real data on human genealogy illustrate 
the problem, which is relevant to the study of the real degree of diversity in 
closed interbreeding communities. 



Typeset using REVT^ 



Modern man appeared on Earth some 10^ years ago |T|J^. At that time, few social groups 
totalhng several thousands of individuals were occupying small settlements, most probably 
in Africa 0. Nowadays, we are faced with about 5 x 10^ human beings on Earth, whose 
lineages could be in principle traced back to that time. Each human being has two parents, 
four grandparents, and in general 2"" ancestors in the n^— th upper generation. Going 
backwards in time until the first group of anatomically modern Homo sapiens -some 4000 
generations ago- we should find 2^*^°° ~ ]^gi2oo ancestors in each genealogical tree. However, 
the total human population at those early times was probably of a few thousands only! The 
answer to this apparent paradox is simple: a given individual appears more than once in a 
genealogical tree Q, even in very distant branches, indicating that many of the ancestors 
were in fact close relatives. A repeated individual generates a whole repeated branch in the 
tree, and the further we move into the past, the more frequent the repetitions will be. This 
is the result of mating inside a finite population, the size of which sets an upper bound to 
the maximal number of ancestors for a given individual. 

These repetitions are particularly apparent when we are faced with a small closed inter- 
breeding population. Royal genealogy provides us with a nice example, since nobles usually 
married within their own castes. As an illustration to the problem, we have analysed the 
repetitions in the genealogical tree of the English king Edward III (1312-1377) 0. It con- 
tains almost 10'^ individuals, some of which appear more than once (and up to six times) in 
his tree. We show in Fig. 1 the function F{r), defined as the quotient between the num- 
ber M{r) of ancestors which appear r times in the tree and the total number of different 
ancestors Nt, F{r) = M{r)/Nt. 

We study here the statistics of repetitions in genealogical trees as a function of the 
population size and the generation in the past that we are looking at. The question that we 
are addressing can be put in the more general context of genetic diversity |P,0. In fact, an 
important factor in the variability of natural populations is the diversity displayed, in the 
genealogical history of every individual, by his ancestors themselves, and by their weights 
in the present genome. Here we calculate these weights in a simple neutral model, with no 
selection, no change in the population size and no geographical isolation. Possible effects of 
these on genealogies and genetic diversities are discussed in @J|,^. 

We have started by performing numerical simulations of a simple neutral model of a 
closed population evolving under sexual reproduction with non-overlapping generations. In 
our model the population size is fixed to be A^ for all generations. The population is equally 
divided into two groups, representing males and females. At every generation, we form 
heterosexual pairs at random and assign them a certain number of descendants according 
to a Poisson distribution. This is done by choosing for each male or each female a pair of 



parents at random in the previous generation []10|. After a number G of generations, the 
tree of each of the individuals in the youngest generation is reconstructed. 

We have first calculated the distribution F{r) of repetitions in this model for a population 
N = 2^^ and A^ = 2^^ individuals. This might be a rough estimate of the number of noble 
people at the time of Edward HI. After G = 10 generations, we compute the probability 
of repetitions in the whole tree (notice that in the real world generations often overlap 
and thus the same person might be found in different generations; this possibility is absent 
in our model). The result of our simulations is compared with the real data displayed in 
Figure 1. We observe an acceptable agreement, although we should say that the distribution 
F{r) depends rather strongly on G and A^ and that the agreement is often worse for other 



reasonable choices of these two parameters. 

We have also measured the probability of repetitions H{r,ng) at every past generation 
Ug = 1,...,G, that is, the probability that any individual at generation n^ in the past 
appears r times in the tree of an individual at generation [ug = 1 corresponds to the 
parents, n^ = 2 to the grandparents and so on; note that H{0,ng) is simply the probability 
that an individual is not present in a tree after Ug generations). In the first few generations 
(parents, grandparents...), if the population size N is large, the probability of finding an 
individual more than once in a tree is very small. As a consequence H{r,ng) decreases with 
r when Ug is small. Going further in the past, at some point two "brothers" will appear in 
the tree of an individual, and from then on these two branches will coincide. From then on, 
more and more repetitions will occur. 

The distribution H{r,ng) is shown in Figure 2. It changes its shape during a transient 
period of the order of log N generations. (Note that an important difference between F[r) 
shown in figure 1 and H{r,ng) is that for F we counted only those individuals present in 
one particular genealogical tree whereas for H we count the whole population at generation 
Ug in the past.) Clearly, we have J2r>oH{r,ng) = 1 and J2r''^H{r,ng) = 2"9. In figure 2 
we see for N = 2^^ the function H{r,ng) for different generations before and after reaching 
the stationary shape. For Ug small, H{r,ng) decreases with r, meaning that repetitions are 
very unprobable. As n^ increases, the number of repetitions increases and H{r, Ug) exhibits 
a maximum and a shape which becomes stationary. 

If we rescale the distribution H{r, Ug) by plotting as in Figure 3 the distribution 

P(w) = 2"«/7(r,nJ/A^ (1) 

versus the weight 

w = rNjTo (2) 

all the distributions of Figure 2 (after a transient period) collapse on a single stationary 
function. Figure 3 represents the function P[w) for several values of Ug after the transient 
period obtained for a population of A^ = 2^° individuals. We observe that the left tail of P{w) 
is a power law, P{w) ~ w^, and a least squares fit to our numerical results in the domain 
w G (10~^, 10^^) returns (5 ^ 0.302. In addition to the exponent /?, one can accurately 
measure the moments of (w") = J w"'P{w)dw of P{w) as well as the fraction S{ng) of the 
total population in the oldest generation which is absent from a given genealogical tree. 
Figure 4 contains our numerical estimates for S{ng). Figure 4 shows also the first moments 
of the distribution P{w). As can be seen, even when the number of potential ancestors in 
the tree is much larger than the number of individuals in the population, not all of those 
give contributions to the present. In fact, the proportion of individuals without descendents 
reaches a fixed value, S{ng — >• cxd) ~ 0.2031. 

The distribution P{w) can be understood analytically by the following argument: if we 
consider the genealogical tree of an individual, say individual i = 1 at the 0th generation, 
the weights w of his ancestors can be traced back according to the following algorithm. From 
(0,^ we have Wi{0) = N for i = 1 and Wi{0) = for i y^ 1. Then the weights of the ancestors 
at generation n^ + 1 in the past, with < n^ < G — 1, are given by 
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When A^ is large, the probabihty pk that an individual at generation n^ + 1 in the past 
has k children at generation ng becomes a Poisson distribution 

Pk = 2'e-yk\ (4) 

Now if for large N we consider that the weights of the children of any given individual 
are uncorrelated, (this can be viewed as an approximation, but in fact, by calculating pair 
correlations between the weights in our model, one can show that for large enough A^ this 
approximation becomes exact), we obtain from (0,^ that any weight at generation n^ + 1 is 
the sum of k i.i.d. weights at generation Ug with k itself randomly chosen according to (^. 
Then if Qn (A) is the generating function of the weights at generation Ug in the past, 

it follows from (3) (and the fact that for large N the Wj{ng) are uncorrelated) that it satisfies 

fc=0 

This recursion has the form of a renormalization group transformation. Together with the 
initial condition 

^o(A) = 1 + (e^^ - l)/N (6) 

it determines all the generating functions Qn (A). When Ug -^ oo, the generating function 
converges to a limit g{X) solution of 

(^(A)=e2^(^/2)-2_ (7) 

All the informations on the shape of the stationary solution P{w) are contained in the 
solution of (|^. For example, one can expand ^'(A) solution of (0) in power series and find 
that 

.^^ . X ^2 8^3 46 ^ 4 2672 . 183712 . 

o(A) = 1 + A + A^ + -A^ H A^ H A^ H A^ + ... 

^^ ' 9 63 4725 439425 

This leads to (w) = 1, (w^) = 2, (w^) = 16/3, {w"^) = 368/21 and so on. (Note that (w) = 1 
is not determined by (0) but this is an immediate consequence of the initial condition 
(^).) One can also determine the fraction S of individuals with no descendence (that is the 
probability that w = 0) hj S = g{—oo). Clearly, 5* = g{—oo) is the solution of 

S = e''-' 

and this gives S = .20318787... 

The power law P{w) ~ w^ at small w can also be easily understood from (|^. If 
P(w) ~ Aw^ for small w, one can write that as A — i> — cxo 



g{X)-Sc^AT{P + l)\\\ 



-/3-1 



and equation (^ gives (by the standard renormalization argument used to calculate expo- 
nents by linearization around a fixed point and which consists in writing the compensation 
of the singularities proportional to |A|"^~^ on both sides of (^)) that 

(3 = -^ 2 -.2991138... 

log 2 

in excellent agreement with the results of the simulation. Other properties of the stationary 
distribution P{w) could in principle be extracted from ([^) but this would require more 
complicated mathematical developments. 

In this work, we have shown that a simple neutral model of sexual reproduction with 
non-overlapping generations leads to a universal distribution of the weights of ancestors in 
genealogical trees. This universal distribution (more precisely its generating function) is the 
fixed point (|^ of a simple renormalization equation (^). The exponent (3 of the power law 
observed for small weights can be calculated exactly. 

Our main result is that if we go very far in the past, about 80% of the (adult) population 
appears in the genealogical tree of every individual. If the weights of these ancestors represent 
how often they appear in this tree, these weights have a stationary probability distribution 
which is universal (i.e. independent of the generation and of the population size). 

There are a number of extensions of the present work which, in our opinion, are worth 
pursuing. First, a more complete description of P{w), in particular the large w behavior, 
could be extracted from (|^). If we wish to perform a better approximation to real genealogy, 
the possibility of overlaps between generations or of changes in the population size should 
be included. One can try to measure the distribution of lengths of segments in simple 
models [0,|n|] for the evolution of chromosomes to see whether a power law in the length 
distribution is present there too. One could also investigate how our results would be 
modified by choosing instead of (^ a non-Poissonian distribution of offsprings. Lastly, it 
would be interesting to consider the genealogical trees of several individuals to see how the 



repetitions on different trees are correlated [|12 



With a little more imagination, one can construct other universality classes, by allowing 
the number p of parents of each individual to be arbitrary, instead of p = 2 in our earthy 
world. For general p, the fixed point equation (|^ would become g{X) = exp[—p + pg{X/p)]- 
No need to say that one might then try to expand the distribution P{w), the fixed point 



5* or the exponent (3 in powers of e for p = 1 + e. In fact, one can show [|T2[ that the 
case of an exponentially increasing (or decreasing) population size with p = 2 parents for 
each individual is equivalent, as long as g{X) is concerned, to the case of a population of 
constant size with a number of parents p which depends on the exponential growth rate of 
the population. 

Apart from the potential application of our results to population genetics and evolu- 
tionary biology, the model of evolution studied here is connected to a number of problems 
of current interest in physics. First, the random assignment of the parents of individuals 
at each generation is very reminiscent of a problem of repartition of constraints introduced 



recently []T3| to describe granular materials, with a recursion similar to (^. Graphs which 
locally look like trees but where large loops -responsible for cooperative effects- are present 
have attracted a lot of interest in the theory of disordered systems (spin glasses, localization) 
0- 0, [0 



Lastly, the model studied here gives through (|,^ a very simple and pedagogical ex- 
ample of a problem with a non-trivial exponent, which can be solved exactly by a discrete 



renormalization transformation. One could try to see whether the oscillations pO[ which 
usually accompany such discrete renormalization transformations are present here too. 
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FIGURE CAPTION 

1. Probability of ancestor repetitions in the genealogical tree of the king Edward III [^. 
The continuous and dashed lines represent the results of simulations of F{r) in a closed 
population with 2^^ and 2^^ individuals for our model. Averages have been performed 
over the 10 first generations of 10^ independent trees. 

2. Distribution H{r, Ug) of r repetitions after Ug generations {H{Q, Ug) is not shown). The 
distribution changes after roughly log A^ generations from a decreasing function of r to a 
distribution with a maximum. The generations shown are Ug = 9, 13, 15, 17, 19, 21, 
and 23 for a population with A^ = 2^^. We have averaged over 100 independent runs. 

3. Data collapse for the rescaled distribution of repetitions P{w) after the transient pe- 
riod. Averages have been performed over 10^ independent trees for a population size 

N = 220. 

4. Dependence of S{ng) on the generation Ug for a population with A^ = 2^^. The 
numerical asymptotic value is S{ng -^ oo) ^ 0.2031. The bold dotted line is the 
predicted theoretical value S = g{—oo) = 0.20318787. . .. In the inset, we represent 
the first ten moments (w") for the distribution P{w). The continuous line corresponds 
to numerical results, while solid circles stand for the theoretical predictions. 
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